In this notebook, we are going to examine data about the disease covid-19 caused by the novel coronavirus (nCoV) 2020 pandemic. The data is collected globally in every country (for some countries, e.g. China, also in each region) and updated every 24h.
The dataset is available here: https://www.kaggle.com/imdevskp/corona-virus-report
Our goal is to determine the mortality rate per country over time, and visualise the curve of confirmed cases, deaths, recovered and active cases.
Finally we will forecast our time series with Prophet to analyse the curve of new infections and deaths rate in the UK vs the US.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
import plotly.express as px
# App fluorish
from IPython.display import Javascript
from IPython.core.display import display
from IPython.core.display import HTML
covid_19 = pd.read_csv('data/covid_19_clean_complete.csv', parse_dates=['Date'])
covid_19.head()
df1 = covid_19.groupby(["Date", "Country/Region"])['Confirmed'].sum().reset_index()
df_fluorish = df1.pivot(index='Country/Region', columns='Date').reset_index()
# df_fluorish.to_csv(r'data/covid_19_merged.csv', index = False)
We will only focus on:
We can ignore coordinates, drop missing values and group every country by its region if any are available.
# cases
cases = ['Confirmed', 'Deaths', 'Recovered', 'Active']
# Active Case = confirmed - deaths - recovered
covid_19['Active'] = covid_19['Confirmed'] - covid_19['Deaths'] - covid_19['Recovered']
# replacing Mainland china with just China
covid_19['Country/Region'] = covid_19['Country/Region'].replace('Mainland China', 'China')
# filling missing values
covid_19[['Province/State']] = covid_19[['Province/State']].fillna('')
covid_19[cases] = covid_19[cases].fillna(0)
# fixing datatypes
covid_19['Recovered'] = covid_19['Recovered'].astype(int)
Next, we get the latest date and sum all countries to find confirmed cases, deaths and mortality rate globally.
The overall reported mortality rate is 0.05%as of 31/3/2020.
# latest
latest = covid_19[covid_19['Date'] == max(covid_19['Date'])].reset_index()
# latest condensed
latest_grouped = latest.groupby('Country/Region')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
total = covid_19.groupby(['Country/Region', 'Province/State'])['Confirmed', 'Deaths', 'Recovered', 'Active'].max()
total = covid_19.groupby('Date')['Confirmed', 'Deaths', 'Recovered', 'Active'].sum().reset_index()
total = total[total['Date']==max(total['Date'])].reset_index(drop=True)
total['Global Moratality'] = total['Deaths']/total['Confirmed']
total['Deaths per 100 Confirmed Cases'] = total['Global Moratality']*100
total.style.background_gradient(cmap='inferno')
Now we can group by countries and display in order of confirmed cases.
The US is the country with the highest number of reported infected, while Italy reports the most deaths. China instead which reports the highest number of recovered cases, being in the most advanced stage of the contagion.
by_confirmed = latest_grouped.sort_values(by='Confirmed', ascending=False)
by_confirmed = by_confirmed[['Country/Region', 'Confirmed', 'Active', 'Deaths', 'Recovered']]
by_confirmed = by_confirmed.reset_index(drop=True)
by_confirmed.style.background_gradient(cmap="Blues", subset=['Confirmed'])\
.background_gradient(cmap="Oranges", subset=['Active'])\
.background_gradient(cmap="Greens", subset=['Recovered'])\
.background_gradient(cmap="Reds", subset=['Deaths'])
Let's do the same, this time only displaying overall deaths and calculating the mortality per country as:
Mortality rate = number of deaths / number of confirmed
The mortality of the virus varies greatly, with Italy at almost 12% and Germany at 1% being the 2 extremes in Europe.
by_deaths = by_confirmed[by_confirmed['Deaths']>0][['Country/Region', 'Deaths']]
by_deaths['Deaths / 100 Cases'] = round((by_confirmed['Deaths']/by_confirmed['Confirmed'])*100, 2)
by_deaths.sort_values('Deaths', ascending=False).reset_index(drop=True).style.background_gradient(cmap='Reds')
# Deaths
temp = latest_grouped[latest_grouped['Deaths']>0]
fig = px.choropleth(temp,
locations="Country/Region", locationmode='country names',
color=np.log(temp["Deaths"]), hover_name="Country/Region",
color_continuous_scale="Peach", hover_data=['Deaths'],
title='Countries with Deaths Reported')
fig.update(layout_coloraxis_showscale=False)
fig.show()
Live visualisation with https://app.flourish.studio/visualisation/1714161/edit
HTML('''<div class="flourish-embed flourish-bar-chart-race" data-src="visualisation/1714161" data-url="https://public.flourish.studio/visualisation/1714161/embed"><script src="https://public.flourish.studio/resources/embed.js"></script></div>''')
Prophet is a forecasting procedure implemented in R and Python. At its core, the Prophet procedure is an additive regression model.
We will predict the curve of infections and deaths for the UK and for the USA. The two countries vary a lot interms of population size, isolation measures and social distancing, with the UK adopting an approach more similar to Italy and the US having a less restricted system.
from fbprophet import Prophet
Let's examine how out features will evolve with time in the next 30 days. Starting with the number of confirmed cases globally.
data = ['Confirmed', 'Deaths', 'Recovered', 'Active']
df = covid_19.loc[covid_19['Country/Region'] == 'United Kingdom']
df_uk = df.groupby(df['Date']).sum()
df_uk = df_uk[data]
df_uk.reset_index(inplace=True)
df_uk['Date'] = pd.to_datetime(df_uk.Date)
df_uk.head()
By default, Prophet uses a linear model to forecast. When forecasting growth, there is usually some maximum achievable point: total market size, total population size, a virus spread etc. This is called the carrying capacity, and the forecast should saturate at this point.
The UK, has a population size comparable to Italy's. Similar precaution measure were taken to increase social distancing and enforce a lockdown. For this reasons and the fact that these measures were taken at similar stage of the contagion, we are going to assume (and hope) that the total number of cases will reach its peak between 100.000 and 150.000 cases
uk_confirmed = df_uk[['Date', 'Confirmed']].rename(columns={'Date': 'ds', 'Confirmed': 'y'})
uk_deaths = df_uk[['Date', 'Deaths']].rename(columns={'Date': 'ds', 'Deaths': 'y'})
uk_confirmed['cap'] = 150000
m = Prophet(growth='logistic')
m.fit(uk_confirmed)
future = m.make_future_dataframe(periods=30)
future['cap'] = 150000
future.tail()
Finally, let's predict the curve of infected for the next 30 days
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
If the model is correct, the UK might reach its peak of infected population by the beginning of May and stabilise thereafter.
fig2 = m.plot_components(forecast)
During the week, there seems to be a peak in new recorder cases towards the end of the week. This might be due to the fact that hospitals usually take some time to investigate and record a new infected case, or a new death, so we should expect a similar line for recorded deaths.
uk_deaths['cap'] = 10500
At its current mortality rate of ~7%, we will cap UK's deaths at 10500. The curve indicates that the UK is set to have approximately 5000 deaths by next week. It's important to notice this growth rate could decrease if everybody respect the lockdown, limiting the number of new infections.
m = Prophet(growth='logistic')
m.fit(uk_deaths)
future = m.make_future_dataframe(periods=14)
future['cap'] = 10500
fcst = m.predict(future)
fig = m.plot(fcst)
Now let's examine the USA. This country has a much larger population size and rate of infections. The US are currently adopting a protocol of social distancing but are not adopting the same restrictions, or lockdown, as the UK or other European countries. Here it is estimated that up to 200.000 people may die as a result of the covid-19 and potentially millions will be infected.
For this reason we will not cap the maximum number of infected or deaths for now, as we won't predict that far in time.
data = ['Confirmed', 'Deaths', 'Recovered', 'Active']
df = covid_19.loc[covid_19['Country/Region'] == 'US']
df_us = df.groupby(df['Date']).sum()
df_us = df_us[data]
df_us.reset_index(inplace=True)
df_us['Date'] = pd.to_datetime(df_us.Date)
df_us.tail()
us_confirmed = df_us[['Date', 'Confirmed']].rename(columns={'Date': 'ds', 'Confirmed': 'y'})
us_deaths = df_us[['Date', 'Deaths']].rename(columns={'Date': 'ds', 'Deaths': 'y'})
us_confirmed['cap'] = 1500000
us_confirmed.tail()
m = Prophet(growth='logistic')
m.fit(us_confirmed)
future = m.make_future_dataframe(periods=10)
future['cap'] = 1500000
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
Regarding the number of infections, US is set to potentially reach half million infected citizens by next week already, unless social distancing produces its effects earlier, which however, most consider to be unlikely.
In a recent interview Trump claimed to be willing to keep the number of deaths below 200.000, but we will use the much lower value of 20.000 as a cap to compare the death rate directly with the UK, and considering the US has currently a mortality percentage of 2%.
us_deaths['cap'] = 20000
m = Prophet(growth='logistic')
m.fit(us_deaths)
future = m.make_future_dataframe(periods=14)
future['cap'] = 20000
fcst = m.predict(future)
fig = m.plot(fcst)
As we can see the raise is much sharper, with 5000 fatalities by the second of April and 7500 only few days after.
There are several limitation in the following model:
The lockdowns and social distancing recommendations adopted by many countries around the world will certainly have their positive effect, by slowing down the rate of new infections and deaths reported. This usually shows in the curve with a 2 to 3 weeks delay from the moment the measures were adopted, so predictions may be worse than actual values depending on how strict these measures are.
Different countries report new infections and deaths differently. The way tests are carried can vary a lot and hospital capacity is certainly a major factor which can gratly contribute to mortality rate and should be accounted for.
We examined a dataset reporting the total confirmed covid-19 cases and deaths globally and per each country. We were able to determine the overall mortality rate and notice great differences among countries and continents.
The results seem to highlight the fact that social distancing and countries lockdown have indeed a huge impact on the total number of infected and dead if these measures are taken in time.
Finally, after highlighting all the limitation that a logistic model can have on our final prediction, we showed two countries, the US and the UK. Despite their very different population sizes, density, government guidelines and social distancing measures, we were able to plot curves of infected and deaths and forecast the future for these values for the next 14 to 30 days.